Introduction¶
Welcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:
- The number of rooms
- The distance to employment centres
- How rich or poor the area is
- How many students there are per teacher in local schools etc
To accomplish your task you will:
- Analyse and explore the Boston house price data
- Split your data for training and testing
- Run a Multivariable Regression
- Evaluate how your model's coefficients and residuals
- Use data transformation to improve your model performance
- Use your model to estimate a property price
Upgrade plotly (only Google Colab Notebook)¶
Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
# %pip install --upgrade plotly
Import Statements¶
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# TODO: Add missing import statements
Notebook Presentation¶
pd.options.display.float_format = '{:,.2f}'.format
Load the Data¶
The first column in the .csv file just has the row numbers, so it will be used as the index.
data = pd.read_csv('boston.csv', index_col=0)
Understand the Boston House Price Dataset¶
Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.
:Attribute Information (in order):
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
14. PRICE Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.
Preliminary Data Exploration 🔎¶
Challenge
- What is the shape of
data? - How many rows and columns does it have?
- What are the column names?
- Are there any NaN values or duplicates?
data.shape
(506, 14)
data.sample(3)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 113 | 0.22 | 0.00 | 10.01 | 0.00 | 0.55 | 6.09 | 95.40 | 2.55 | 6.00 | 432.00 | 17.80 | 396.90 | 17.09 | 18.70 |
| 265 | 0.76 | 20.00 | 3.97 | 0.00 | 0.65 | 5.56 | 62.80 | 1.99 | 5.00 | 264.00 | 13.00 | 392.40 | 10.45 | 22.80 |
| 294 | 0.08 | 0.00 | 13.92 | 0.00 | 0.44 | 6.01 | 42.30 | 5.50 | 4.00 | 289.00 | 16.00 | 396.90 | 10.40 | 21.70 |
data.duplicated().values.any()
False
data.describe()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 |
| mean | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 | 22.53 |
| std | 8.60 | 23.32 | 6.86 | 0.25 | 0.12 | 0.70 | 28.15 | 2.11 | 8.71 | 168.54 | 2.16 | 91.29 | 7.14 | 9.20 |
| min | 0.01 | 0.00 | 0.46 | 0.00 | 0.39 | 3.56 | 2.90 | 1.13 | 1.00 | 187.00 | 12.60 | 0.32 | 1.73 | 5.00 |
| 25% | 0.08 | 0.00 | 5.19 | 0.00 | 0.45 | 5.89 | 45.02 | 2.10 | 4.00 | 279.00 | 17.40 | 375.38 | 6.95 | 17.02 |
| 50% | 0.26 | 0.00 | 9.69 | 0.00 | 0.54 | 6.21 | 77.50 | 3.21 | 5.00 | 330.00 | 19.05 | 391.44 | 11.36 | 21.20 |
| 75% | 3.68 | 12.50 | 18.10 | 0.00 | 0.62 | 6.62 | 94.07 | 5.19 | 24.00 | 666.00 | 20.20 | 396.23 | 16.96 | 25.00 |
| max | 88.98 | 100.00 | 27.74 | 1.00 | 0.87 | 8.78 | 100.00 | 12.13 | 24.00 | 711.00 | 22.00 | 396.90 | 37.97 | 50.00 |
Data Cleaning - Check for Missing Values and Duplicates¶
data.isna().values.any()
False
data.isnull().values.any()
False
Descriptive Statistics¶
Challenge
- How many students are there per teacher on average?
- What is the average price of a home in the dataset?
- What is the
CHASfeature? - What are the minimum and the maximum value of the
CHASand why? - What is the maximum and the minimum number of rooms per dwelling in the dataset?
data.PTRATIO.describe() # pupil to teacher ratio
count 506.00 mean 18.46 std 2.16 min 12.60 25% 17.40 50% 19.05 75% 20.20 max 22.00 Name: PTRATIO, dtype: float64
#average = 18.46. Looks like per one teacher we're getting 18 students
data['PRICE'].describe()
count 506.00 mean 22.53 std 9.20 min 5.00 25% 17.02 50% 21.20 75% 25.00 max 50.00 Name: PRICE, dtype: float64
data['PRICE'].describe()[1] *1000 # average price of house in $
22532.806324110676
# CHAS is literally 1/0 like Boolean representation of information that house is next to the river
data['RM'].describe()
count 506.00 mean 6.28 std 0.70 min 3.56 25% 5.89 50% 6.21 75% 6.62 max 8.78 Name: RM, dtype: float64
# min = 4 max = 9 - AVERAGE number of rooms. 1 and max is 19?
Visualise the Features¶
Challenge: Having looked at some descriptive statistics, visualise the data for your model. Use Seaborn's .displot() to create a bar chart and superimpose the Kernel Density Estimate (KDE) for the following variables:
- PRICE: The home price in thousands.
- RM: the average number of rooms per owner unit.
- DIS: the weighted distance to the 5 Boston employment centres i.e., the estimated length of the commute.
- RAD: the index of accessibility to highways.
Try setting the aspect parameter to 2 for a better picture.
What do you notice in the distributions of the data?
# fig, ax = plt.subplots()
sns.displot(data=data, x="PRICE",aspect=2,kde=True,) # kind="kde",aspect=2,ax=ax)
sns.displot(data=data, x="RM",aspect=2,kde=True,)
sns.displot(data=data, x="DIS",aspect=2,kde=True,)
sns.displot(data=data, x="RAD",aspect=2,kde=True,)
# sns.displot(data=data, x="RM", kind="kde",aspect=2,ax=ax)
# sns.displot(data=data, x="DIS", kind="kde",aspect=2,ax=ax)
# sns.displot(data=data, x="RAD", kind="kde",aspect=2,ax=ax)
# sns.plt.show()
<seaborn.axisgrid.FacetGrid at 0x7b331e9eb2e0>
# Index of accesibility to highways was a quite suprising looks like there is
# some kind of factory distinct or something like this
# 6/7 rooms? Hugeeee, those are kind of standalone houses?
# Prices are not suprising, but i assume that there is some low cost and high
# cost areas
# Separated plots are below
House Prices 💰¶
sns.displot(data=data, x="PRICE",aspect=2,kde=True,)
<seaborn.axisgrid.FacetGrid at 0x7b331c7e4b80>
Distance to Employment - Length of Commute 🚗¶
sns.displot(data=data, x="DIS",aspect=2,kde=True,)
<seaborn.axisgrid.FacetGrid at 0x7b331c711e70>
Number of Rooms¶
sns.displot(data=data, x="RM",aspect=2,kde=True,)
<seaborn.axisgrid.FacetGrid at 0x7b331c8115d0>
Access to Highways 🛣¶
sns.displot(data=data, x="RAD",aspect=2,kde=True,)
<seaborn.axisgrid.FacetGrid at 0x7b331c5e3100>
Next to the River? ⛵️¶
Challenge
Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it. The bar chart should look something like this:
You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])
data['CHAS'].value_counts()
0.00 471 1.00 35 Name: CHAS, dtype: int64
bar = px.bar(x=['No', 'Yes'],
y=data['CHAS'].value_counts(),
color=data['CHAS'].value_counts(),
)
bar.show()
Understand the Relationships in the Data¶
Run a Pair Plot¶
Challenge
There might be some relationships in the data that we should know about. Before you run the code, make some predictions:
- What would you expect the relationship to be between pollution (NOX) and the distance to employment (DIS)?
- What kind of relationship do you expect between the number of rooms (RM) and the home value (PRICE)?
- What about the amount of poverty in an area (LSTAT) and home prices?
Run a Seaborn .pairplot() to visualise all the relationships at the same time. Note, this is a big task and can take 1-2 minutes! After it's finished check your intuition regarding the questions above on the pairplot.
# higher distance = lower polution
# higher_num_of_rooms == higher_price, but not on suburbs
## higher LSTAT lower home price in area
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7b331b947610>
Challenge
Use Seaborn's .jointplot() to look at some of the relationships in more detail. Create a jointplot for:
- DIS and NOX
- INDUS vs NOX
- LSTAT vs RM
- LSTAT vs PRICE
- RM vs PRICE
Try adding some opacity or alpha to the scatter plots using keyword arguments under joint_kws.
Distance from Employment vs. Pollution¶
Challenge:
Compare DIS (Distance from employment) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as the distance increases?
sns.jointplot(data=data,
x='DIS',
y='NOX',
hue='CHAS',
)
<seaborn.axisgrid.JointGrid at 0x7b331c50ead0>
Proportion of Non-Retail Industry 🏭🏭🏭 versus Pollution¶
Challenge:
Compare INDUS (the proportion of non-retail industry i.e., factories) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as there is a higher proportion of industry?
sns.jointplot(data=data,
x='INDUS',
y='NOX',
hue='CHAS',
)
<seaborn.axisgrid.JointGrid at 0x7b33149bfca0>
% of Lower Income Population vs Average Number of Rooms¶
Challenge
Compare LSTAT (proportion of lower-income population) with RM (number of rooms) using Seaborn's .jointplot(). How does the number of rooms per dwelling vary with the poverty of area? Do homes have more or fewer rooms when LSTAT is low?
sns.jointplot(data=data,
x='LSTAT',
y='RM',
hue='CHAS',
)
<seaborn.axisgrid.JointGrid at 0x7b330f118af0>
% of Lower Income Population versus Home Price¶
Challenge
Compare LSTAT with PRICE using Seaborn's .jointplot(). How does the proportion of the lower-income population in an area affect home prices?
sns.jointplot(data=data,
x='LSTAT',
y='PRICE',
hue='CHAS',
)
<seaborn.axisgrid.JointGrid at 0x7b330f00cd00>
Number of Rooms versus Home Value¶
Challenge
Compare RM (number of rooms) with PRICE using Seaborn's .jointplot(). You can probably guess how the number of rooms affects home prices. 😊
sns.jointplot(data=data,
x='RM',
y='PRICE',
hue='DIS',
)
<seaborn.axisgrid.JointGrid at 0x7b330ede57e0>
# THE 6 ROOM flats which are nearby to employment centres are the cheapest???
# the most wealthy ppl like to live next to the river
Split Training & Test Dataset¶
We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.
Challenge
- Import the
train_test_split()function from sklearn - Create 4 subsets: X_train, X_test, y_train, y_test
- Split the training and testing data roughly 80/20.
- To get the same random split every time you run your notebook use
random_state=10. This helps us get the same results every time and avoid confusion while we're learning.
Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.
X = data.iloc[:,:-1]
y=data.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.2,
random_state=10)
Multivariable Regression¶
In a previous lesson, we had a linear model with only a single feature (our movie budgets). This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:
$$ PR \hat ICE = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta _3 DIS + \theta _4 CHAS ... + \theta _{13} LSTAT$$
Run Your First Regression¶
Challenge
Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data?
from sklearn.linear_model import LinearRegression
regression = LinearRegression()
regression.fit(X_train,y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
regression.score(X_test,y_test)
0.6709339839115642
Evaluate the Coefficients of the Model¶
Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative).
Challenge Print out the coefficients (the thetas in the equation above) for the features. Hint: You'll see a nice table if you stick the coefficients in a DataFrame.
- We already saw that RM on its own had a positive relation to PRICE based on the scatter plot. Is RM's coefficient also positive?
- What is the sign on the LSAT coefficient? Does it match your intuition and the scatter plot above?
- Check the other coefficients. Do they have the expected sign?
- Based on the coefficients, how much more expensive is a room with 6 rooms compared to a room with 5 rooms? According to the model, what is the premium you would have to pay for an extra room?
regression.intercept_
36.53305138282431
regression.coef_
array([-1.28180656e-01, 6.31981786e-02, -7.57627602e-03, 1.97451452e+00,
-1.62719890e+01, 3.10845625e+00, 1.62922153e-02, -1.48301360e+00,
3.03988206e-01, -1.20820710e-02, -8.20305699e-01, 1.14189890e-02,
-5.81626431e-01])
regression.coef_.shape
(13,)
data.columns[:-1]
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT'],
dtype='object')
dict(map(lambda i,j : (i,j) , data.columns[:-1],regression.coef_))
{'CRIM': -0.12818065642264795,
'ZN': 0.06319817864608888,
'INDUS': -0.00757627601533797,
'CHAS': 1.9745145165622597,
'NOX': -16.271988951469734,
'RM': 3.1084562454033,
'AGE': 0.01629221534560711,
'DIS': -1.4830135966050273,
'RAD': 0.30398820612116106,
'TAX': -0.012082071043592574,
'PTRATIO': -0.8203056992885642,
'B': 0.011418989022213357,
'LSTAT': -0.581626431182139}
# according to last question mby we can make some examples instead pure math?
# in case if you answered: NO!!!
# Looks like we have to pay 3k $ for extra room
##
## Lets find some flat with 6 rooms
sample_x = data.sample(1)
print(sample_x)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \
379 17.87 0.00 18.10 0.00 0.67 6.22 100.00 1.39 24.00 666.00 20.20
B LSTAT PRICE
379 393.74 21.78 10.20
sample_x
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 379 | 17.87 | 0.00 | 18.10 | 0.00 | 0.67 | 6.22 | 100.00 | 1.39 | 24.00 | 666.00 | 20.20 | 393.74 | 21.78 | 10.20 |
sample_x = sample_x.drop(['PRICE'],axis=1)
sample_x
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 379 | 17.87 | 0.00 | 18.10 | 0.00 | 0.67 | 6.22 | 100.00 | 1.39 | 24.00 | 666.00 | 20.20 | 393.74 | 21.78 |
regression.predict(sample_x)
array([16.61196204])
sample_x['RM'] = 5.0
sample_x
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 379 | 17.87 | 0.00 | 18.10 | 0.00 | 0.67 | 5.00 | 100.00 | 1.39 | 24.00 | 666.00 | 20.20 | 393.74 | 21.78 |
regression.predict(sample_x)
array([12.81032005])
# in my run it gave me diff on lvl of 3,5k - 4k thousands of dolars
sample_x['RM'] = 6.0
regression.predict(sample_x)
array([15.9187763])
Analyse the Estimated Values & Regression Residuals¶
The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions ($\hat y_i$) and the true values ($y_i$) inside y_train.
predicted_values = regr.predict(X_train)
residuals = (y_train - predicted_values)
Challenge: Create two scatter plots.
The first plot should be actual values (y_train) against the predicted value values:
The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊
The second plot should be the residuals against the predicted prices. Here's what we're looking for:
y_train_predicted = regression.predict(X_train)
residuals = (y_train - y_train_predicted )
plt.figure(figsize=(8,4), dpi=200)
ax = sns.scatterplot(x=y_train,
y=y_train_predicted,)
# plt.plot(y_train,y_train)
sns.scatterplot(x=y_train,
y=y_train,
c='cyan',)
ax.set(xlabel='Real price',ylabel='Our predictions')
plt.show()
plt.figure(figsize=(8,4), dpi=200)
ax = sns.scatterplot(y=residuals,
x=y_train_predicted,
hue=residuals,)
ax.set(xlabel='Predicted price',ylabel='Divergence')
plt.show()
Why do we want to look at the residuals? We want to check that they look random. Why? The residuals represent the errors of our model. If there's a pattern in our errors, then our model has a systematic bias.
We can analyse the distribution of the residuals. In particular, we're interested in the skew and the mean.
In an ideal case, what we want is something close to a normal distribution. A normal distribution has a skewness of 0 and a mean of 0. A skew of 0 means that the distribution is symmetrical - the bell curve is not lopsided or biased to one side. Here's what a normal distribution looks like:
Challenge
- Calculate the mean and the skewness of the residuals.
- Again, use Seaborn's
.displot()to create a histogram and superimpose the Kernel Density Estimate (KDE) - Is the skewness different from zero? If so, by how much?
- Is the mean different from zero?
df_residuals = pd.DataFrame( [{'skewness':residuals.skew() , 'mean':residuals.mean(), }] )
sns.displot(data=df_residuals,x='skewness')
<seaborn.axisgrid.FacetGrid at 0x7b330ea3c460>
## something went wrong
sns.distplot(residuals, hist=True, kde=True, bins=20)
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Density")
plt.show()
<ipython-input-65-e3fba22c66e5>:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
Data Transformations for a Better Fit¶
We have two options at this point:
- Change our model entirely. Perhaps a linear model is not appropriate.
- Transform our data to make it fit better with our linear model.
Let's try a data transformation approach.
Challenge
Investigate if the target data['PRICE'] could be a suitable candidate for a log transformation.
- Use Seaborn's
.displot()to show a histogram and KDE of the price data. - Calculate the skew of that distribution.
- Use NumPy's
log()function to create a Series that has the log prices - Plot the log prices using Seaborn's
.displot()and calculate the skew. - Which distribution has a skew that's closer to zero?
# looks like log distribution has skew which is closer to zero
sns.distplot(data['PRICE'], hist=True, kde=True, bins=20)
plt.title("Distribution of Price")
plt.xlabel("Price")
plt.ylabel("Density")
plt.show()
<ipython-input-76-1e7b0b052606>:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
print(f"Skewness: {round(data['PRICE'].skew(),3)}")
Skewness: 1.108
price_log = np.log(data['PRICE'])
sns.distplot(price_log, hist=True, kde=True, bins=20)
plt.title("Log Distribution of Price")
plt.xlabel("Log Price")
plt.ylabel("Density")
plt.show()
<ipython-input-77-ec60f2748e10>:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
print(f"Skewness: {round(price_log.skew(),3)}")
Skewness: -0.33
How does the log transformation work?¶
Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:
We can see this when we plot the actual prices against the (transformed) log prices.
plt.figure(dpi=150)
plt.scatter(data.PRICE, np.log(data.PRICE))
plt.title('Mapping the Original Price to a Log Price')
plt.ylabel('Log Price')
plt.xlabel('Actual $ Price in 000s')
plt.show()
Regression using Log Prices¶
Using log prices instead, our model has changed to:
$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$
Challenge:
- Use
train_test_split()with the same random state as before to make the results comparable. - Run a second regression, but this time use the transformed target data.
- What is the r-squared of the regression on the training data?
- Have we improved the fit of our model compared to before based on this measure?
y = np.log(data['PRICE'])
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.2,
random_state=10)
regression = LinearRegression()
regression.fit(X_train,y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
regression.score(X_test,y_test) # increase over 10%, nice
0.7446922306260739
Evaluating Coefficients with Log Prices¶
Challenge: Print out the coefficients of the new regression model.
- Do the coefficients still have the expected sign?
- Is being next to the river a positive based on the data?
- How does the quality of the schools affect property prices? What happens to prices as there are more students per teacher?
Hint: Use a DataFrame to make the output look pretty.
regression.coef_
array([-1.06717261e-02, 1.57929102e-03, 2.02989827e-03, 8.03305301e-02,
-7.04068057e-01, 7.34044072e-02, 7.63301755e-04, -4.76332789e-02,
1.45651350e-02, -6.44998303e-04, -3.47947628e-02, 5.15896157e-04,
-3.13900565e-02])
dict(map(lambda i,j : (i,j) , data.columns[:-1],regression.coef_))
{'CRIM': -0.010671726123550168,
'ZN': 0.0015792910237792527,
'INDUS': 0.0020298982724026335,
'CHAS': 0.0803305301283453,
'NOX': -0.7040680570150262,
'RM': 0.0734044072331749,
'AGE': 0.0007633017550354285,
'DIS': -0.04763327892124784,
'RAD': 0.014565134991367565,
'TAX': -0.0006449983030440104,
'PTRATIO': -0.03479476276651838,
'B': 0.0005158961569951322,
'LSTAT': -0.03139005646263394}
# CHAS - river bool is positive, barely, cuz it's very small
# More students == lower price
Regression with Log Prices & Residual Plots¶
Challenge:
- Copy-paste the cell where you've created scatter plots of the actual versus the predicted home prices as well as the residuals versus the predicted values.
- Add 2 more plots to the cell so that you can compare the regression outcomes with the log prices side by side.
- Use
indigoas the colour for the original regression andnavyfor the color using log prices.
y_train_predicted = regression.predict(X_train)
plt.figure(figsize=(8,4), dpi=200)
ax = sns.scatterplot(x=y_train,
y=y_train_predicted,)
# plt.plot(y_train,y_train)
sns.scatterplot(x=y_train,
y=y_train,
c='cyan',)
ax.set(xlabel='Real price',ylabel='Our predictions')
plt.show()
residuals = (y_train - y_train_predicted )
plt.figure(figsize=(8,4), dpi=200)
ax = sns.scatterplot(y=residuals,
x=y_train_predicted,
hue=residuals,)
ax.set(xlabel='Predicted price',ylabel='Divergence')
plt.show()
# i don't understand rest of this challenge
Challenge:
Calculate the mean and the skew for the residuals using log prices. Are the mean and skew closer to 0 for the regression using log prices?
sns.distplot(residuals, hist=True, kde=True, bins=20)
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Density")
plt.show()
<ipython-input-95-e3fba22c66e5>:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
Compare Out of Sample Performance¶
The real test is how our model performs on data that it has not "seen" yet. This is where our X_test comes in.
Challenge
Compare the r-squared of the two models on the test dataset. Which model does better? Is the r-squared higher or lower than for the training dataset? Why?
regression.score(X_test,y_test) # w8 i did it earlier. To check!!!
0.7446922306260739
Predict a Property's Value using the Regression Coefficients¶
Our preferred model now has an equation that looks like this:
$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$
The average property has the mean value for all its charactistics:
# Starting Point: Average Values in the Dataset
features = data.drop(['PRICE'], axis=1)
average_vals = features.mean().values
property_stats = pd.DataFrame(data=average_vals.reshape(1, len(features.columns)),
columns=features.columns)
property_stats
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 |
Challenge
Predict how much the average property is worth using the stats above. What is the log price estimate and what is the dollar estimate? You'll have to reverse the log transformation with .exp() to find the dollar value.
predicted_avg = regression.predict(property_stats)
np.exp(predicted_avg) * 1000 # price in $
array([20703.17832102])
Challenge
Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics:
# Define Property Characteristics
next_to_river = True
nr_rooms = 8
students_per_classroom = 20
distance_to_town = 5
pollution = data.NOX.quantile(q=0.75) # high
amount_of_poverty = data.LSTAT.quantile(q=0.25) # low
# Solution:
particular_property_example = property_stats.copy()
particular_property_example['CHAS'] = 1
particular_property_example['RM'] = nr_rooms
particular_property_example['PTRATIO'] = students_per_classroom
particular_property_example['DIS'] = distance_to_town
particular_property_example['NOX'] = pollution
particular_property_example['LSTAT'] = amount_of_poverty
predicted_price_of_example = regression.predict(particular_property_example)
np.exp(predicted_price_of_example) * 1000 # price in $
array([25792.0258724])